Occupations
Step 1. Import the necessary libraries
Step 2. Import the dataset from this address
Step 3. Assign it to a variable called users
Step 4. Discover what is the mean age per occupation
Step 5. Discover the Male ratio per occupation and sort it from the most to the least
Step 6. For each occupation, calculate the minimum and maximum ages
Step 7. For each combination of occupation and sex, calculate the mean age
Step 8. For each occupation present the percentage of women and men
#step 1 Import the necessary libraries
import pandas as pd
#step 2 Import the dataset from this address
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
#Step 3 Assign it to a variable called users
users = pd.read_csv(url,sep='|',index_col='user_id')
users
| age | gender | occupation | zip_code | |
|---|---|---|---|---|
| user_id | ||||
| 1 | 24 | M | technician | 85711 |
| 2 | 53 | F | other | 94043 |
| 3 | 23 | M | writer | 32067 |
| 4 | 24 | M | technician | 43537 |
| 5 | 33 | F | other | 15213 |
| ... | ... | ... | ... | ... |
| 939 | 26 | F | student | 33319 |
| 940 | 32 | M | administrator | 02215 |
| 941 | 20 | M | student | 97229 |
| 942 | 48 | F | librarian | 78209 |
| 943 | 22 | M | student | 77841 |
943 rows × 4 columns
#Step 4 Discover what is the mean age per occupation
users.groupby('occupation')['age'].mean()
occupation administrator 38.746835 artist 31.392857 doctor 43.571429 educator 42.010526 engineer 36.388060 entertainment 29.222222 executive 38.718750 healthcare 41.562500 homemaker 32.571429 lawyer 36.750000 librarian 40.000000 marketing 37.615385 none 26.555556 other 34.523810 programmer 33.121212 retired 63.071429 salesman 35.666667 scientist 35.548387 student 22.081633 technician 33.148148 writer 36.311111 Name: age, dtype: float64
#Step 5 Discover the Male ratio per occupation and sort it from the most to the least
def genders(g):
if g =='M':
return 1
else:
return 0
users['gender_n'] = users['gender'].apply(genders)
x = users.groupby('occupation')['gender_n'].sum() / users.occupation.value_counts() * 100
x.sort_values(ascending=False)
doctor 100.000000 engineer 97.014925 technician 96.296296 retired 92.857143 programmer 90.909091 executive 90.625000 scientist 90.322581 entertainment 88.888889 lawyer 83.333333 salesman 75.000000 educator 72.631579 student 69.387755 other 65.714286 marketing 61.538462 writer 57.777778 none 55.555556 administrator 54.430380 artist 53.571429 librarian 43.137255 healthcare 31.250000 homemaker 14.285714 dtype: float64
#Step 6 For each occupation, calculate the minimum and maximum ages
users.groupby('occupation')['age'].agg(['min','max'])
| min | max | |
|---|---|---|
| occupation | ||
| administrator | 21 | 70 |
| artist | 19 | 48 |
| doctor | 28 | 64 |
| educator | 23 | 63 |
| engineer | 22 | 70 |
| entertainment | 15 | 50 |
| executive | 22 | 69 |
| healthcare | 22 | 62 |
| homemaker | 20 | 50 |
| lawyer | 21 | 53 |
| librarian | 23 | 69 |
| marketing | 24 | 55 |
| none | 11 | 55 |
| other | 13 | 64 |
| programmer | 20 | 63 |
| retired | 51 | 73 |
| salesman | 18 | 66 |
| scientist | 23 | 55 |
| student | 7 | 42 |
| technician | 21 | 55 |
| writer | 18 | 60 |
#Step 7 For each combination of occupation and sex, calculate the mean age
users.groupby(['occupation','gender'])['age'].mean()
occupation gender
administrator F 40.638889
M 37.162791
artist F 30.307692
M 32.333333
doctor M 43.571429
educator F 39.115385
M 43.101449
engineer F 29.500000
M 36.600000
entertainment F 31.000000
M 29.000000
executive F 44.000000
M 38.172414
healthcare F 39.818182
M 45.400000
homemaker F 34.166667
M 23.000000
lawyer F 39.500000
M 36.200000
librarian F 40.000000
M 40.000000
marketing F 37.200000
M 37.875000
none F 36.500000
M 18.600000
other F 35.472222
M 34.028986
programmer F 32.166667
M 33.216667
retired F 70.000000
M 62.538462
salesman F 27.000000
M 38.555556
scientist F 28.333333
M 36.321429
student F 20.750000
M 22.669118
technician F 38.000000
M 32.961538
writer F 37.631579
M 35.346154
Name: age, dtype: float64
#Step 8 For each occupation present the percentage of women and men
a = users.groupby(['occupation','gender']).agg({'gender':'count'})
b = users.groupby('occupation').agg('count')
c = a.div(b,level='occupation') * 100
c.loc[:,'gender']
occupation gender
administrator F 45.569620
M 54.430380
artist F 46.428571
M 53.571429
doctor M 100.000000
educator F 27.368421
M 72.631579
engineer F 2.985075
M 97.014925
entertainment F 11.111111
M 88.888889
executive F 9.375000
M 90.625000
healthcare F 68.750000
M 31.250000
homemaker F 85.714286
M 14.285714
lawyer F 16.666667
M 83.333333
librarian F 56.862745
M 43.137255
marketing F 38.461538
M 61.538462
none F 44.444444
M 55.555556
other F 34.285714
M 65.714286
programmer F 9.090909
M 90.909091
retired F 7.142857
M 92.857143
salesman F 25.000000
M 75.000000
scientist F 9.677419
M 90.322581
student F 30.612245
M 69.387755
technician F 3.703704
M 96.296296
writer F 42.222222
M 57.777778
Name: gender, dtype: float64
Euro Teams
Step 1. Import the necessary libraries
Step 2. Import the dataset from this address
Step 3. Assign it to a variable called euro12
Step 4. Select only the Goal column
Step 5. How many team participated in the Euro2012?
Step 6. What is the number of columns in the dataset?
Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
Step 8. Sort the teams by Red Cards, then to Yellow Cards
Step 9. Calculate the mean Yellow Cards given per Team
Step 10. Filter teams that scored more than 6 goals
Step 11. Select the teams that start with G
Step 12. Select the first 7 columns
Step 13. Select all columns except the last 3
Step 14. Present only the Shooting Accuracy from England, Italy and Russia
# Step 1 Import the necessary libraries
import pandas as pd
# Step 2 Import the dataset from this address
# Step 3 Assign it to a variable called euro12
euro12 = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv')
euro12
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | Subs on | Subs off | Players Used | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 | 9 | 9 | 16 |
| 1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 | 11 | 11 | 19 |
| 2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 | 7 | 7 | 15 |
| 3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 | 11 | 11 | 16 |
| 4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 | 11 | 11 | 19 |
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 | 15 | 15 | 17 |
| 6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 | 12 | 12 | 20 |
| 7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 | 18 | 18 | 19 |
| 8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 | 7 | 7 | 15 |
| 9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 | 7 | 7 | 17 |
| 10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 | 14 | 14 | 16 |
| 11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 | 10 | 10 | 17 |
| 12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 | 7 | 7 | 16 |
| 13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 | 17 | 17 | 18 |
| 14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 | 9 | 9 | 18 |
| 15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 | 9 | 9 | 18 |
16 rows × 35 columns
# Step 4 Select only the Goal column
euro12['Goals']
0 4 1 4 2 4 3 5 4 3 5 10 6 5 7 6 8 2 9 2 10 6 11 1 12 5 13 12 14 5 15 2 Name: Goals, dtype: int64
# Step 5 How many team participated in the Euro2012?
len(euro12['Team'].unique())
16
# Step 6 What is the number of columns in the dataset?
len(euro12.columns)
35
# Step 7 View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline
discipline=euro12[['Team','Yellow Cards','Red Cards']]
discipline
| Team | Yellow Cards | Red Cards | |
|---|---|---|---|
| 0 | Croatia | 9 | 0 |
| 1 | Czech Republic | 7 | 0 |
| 2 | Denmark | 4 | 0 |
| 3 | England | 5 | 0 |
| 4 | France | 6 | 0 |
| 5 | Germany | 4 | 0 |
| 6 | Greece | 9 | 1 |
| 7 | Italy | 16 | 0 |
| 8 | Netherlands | 5 | 0 |
| 9 | Poland | 7 | 1 |
| 10 | Portugal | 12 | 0 |
| 11 | Republic of Ireland | 6 | 1 |
| 12 | Russia | 6 | 0 |
| 13 | Spain | 11 | 0 |
| 14 | Sweden | 7 | 0 |
| 15 | Ukraine | 5 | 0 |
# Step 8 Sort the teams by Red Cards, then to Yellow Cards
discipline.sort_values(['Yellow Cards','Red Cards'])
df = discipline[['Red Cards','Yellow Cards']]
df
| Red Cards | Yellow Cards | |
|---|---|---|
| 0 | 0 | 9 |
| 1 | 0 | 7 |
| 2 | 0 | 4 |
| 3 | 0 | 5 |
| 4 | 0 | 6 |
| 5 | 0 | 4 |
| 6 | 1 | 9 |
| 7 | 0 | 16 |
| 8 | 0 | 5 |
| 9 | 1 | 7 |
| 10 | 0 | 12 |
| 11 | 1 | 6 |
| 12 | 0 | 6 |
| 13 | 0 | 11 |
| 14 | 0 | 7 |
| 15 | 0 | 5 |
# Step 9 Calculate the mean Yellow Cards given per Team
discipline.groupby('Team')['Yellow Cards'].mean()
Team Croatia 9 Czech Republic 7 Denmark 4 England 5 France 6 Germany 4 Greece 9 Italy 16 Netherlands 5 Poland 7 Portugal 12 Republic of Ireland 6 Russia 6 Spain 11 Sweden 7 Ukraine 5 Name: Yellow Cards, dtype: int64
#Step 10 Filter teams that scored more than 6 goals
e = euro12[euro12.Goals > 6]
e[['Team','Goals']]
| Team | Goals | |
|---|---|---|
| 5 | Germany | 10 |
| 13 | Spain | 12 |
# Step 11 Select the teams that start with G
e = euro12[euro12.Team.str[0] == 'G']
e[['Team']]
| Team | |
|---|---|
| 5 | Germany |
| 6 | Greece |
# Step 12 Select the first 7 columns
euro12.iloc[: , 0:7]
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | |
|---|---|---|---|---|---|---|---|
| 0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 |
| 1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 |
| 2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 |
| 3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 |
| 4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 |
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 |
| 6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 |
| 7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 |
| 8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 |
| 9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 |
| 10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 |
| 11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 |
| 12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 |
| 13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 |
| 14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 |
| 15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 |
# Step 13 Select all columns except the last 3
euro12.iloc[: , :-3]
| Team | Goals | Shots on target | Shots off target | Shooting Accuracy | % Goals-to-shots | Total shots (inc. Blocked) | Hit Woodwork | Penalty goals | Penalties not scored | ... | Clean Sheets | Blocks | Goals conceded | Saves made | Saves-to-shots ratio | Fouls Won | Fouls Conceded | Offsides | Yellow Cards | Red Cards | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Croatia | 4 | 13 | 12 | 51.9% | 16.0% | 32 | 0 | 0 | 0 | ... | 0 | 10 | 3 | 13 | 81.3% | 41 | 62 | 2 | 9 | 0 |
| 1 | Czech Republic | 4 | 13 | 18 | 41.9% | 12.9% | 39 | 0 | 0 | 0 | ... | 1 | 10 | 6 | 9 | 60.1% | 53 | 73 | 8 | 7 | 0 |
| 2 | Denmark | 4 | 10 | 10 | 50.0% | 20.0% | 27 | 1 | 0 | 0 | ... | 1 | 10 | 5 | 10 | 66.7% | 25 | 38 | 8 | 4 | 0 |
| 3 | England | 5 | 11 | 18 | 50.0% | 17.2% | 40 | 0 | 0 | 0 | ... | 2 | 29 | 3 | 22 | 88.1% | 43 | 45 | 6 | 5 | 0 |
| 4 | France | 3 | 22 | 24 | 37.9% | 6.5% | 65 | 1 | 0 | 0 | ... | 1 | 7 | 5 | 6 | 54.6% | 36 | 51 | 5 | 6 | 0 |
| 5 | Germany | 10 | 32 | 32 | 47.8% | 15.6% | 80 | 2 | 1 | 0 | ... | 1 | 11 | 6 | 10 | 62.6% | 63 | 49 | 12 | 4 | 0 |
| 6 | Greece | 5 | 8 | 18 | 30.7% | 19.2% | 32 | 1 | 1 | 1 | ... | 1 | 23 | 7 | 13 | 65.1% | 67 | 48 | 12 | 9 | 1 |
| 7 | Italy | 6 | 34 | 45 | 43.0% | 7.5% | 110 | 2 | 0 | 0 | ... | 2 | 18 | 7 | 20 | 74.1% | 101 | 89 | 16 | 16 | 0 |
| 8 | Netherlands | 2 | 12 | 36 | 25.0% | 4.1% | 60 | 2 | 0 | 0 | ... | 0 | 9 | 5 | 12 | 70.6% | 35 | 30 | 3 | 5 | 0 |
| 9 | Poland | 2 | 15 | 23 | 39.4% | 5.2% | 48 | 0 | 0 | 0 | ... | 0 | 8 | 3 | 6 | 66.7% | 48 | 56 | 3 | 7 | 1 |
| 10 | Portugal | 6 | 22 | 42 | 34.3% | 9.3% | 82 | 6 | 0 | 0 | ... | 2 | 11 | 4 | 10 | 71.5% | 73 | 90 | 10 | 12 | 0 |
| 11 | Republic of Ireland | 1 | 7 | 12 | 36.8% | 5.2% | 28 | 0 | 0 | 0 | ... | 0 | 23 | 9 | 17 | 65.4% | 43 | 51 | 11 | 6 | 1 |
| 12 | Russia | 5 | 9 | 31 | 22.5% | 12.5% | 59 | 2 | 0 | 0 | ... | 0 | 8 | 3 | 10 | 77.0% | 34 | 43 | 4 | 6 | 0 |
| 13 | Spain | 12 | 42 | 33 | 55.9% | 16.0% | 100 | 0 | 1 | 0 | ... | 5 | 8 | 1 | 15 | 93.8% | 102 | 83 | 19 | 11 | 0 |
| 14 | Sweden | 5 | 17 | 19 | 47.2% | 13.8% | 39 | 3 | 0 | 0 | ... | 1 | 12 | 5 | 8 | 61.6% | 35 | 51 | 7 | 7 | 0 |
| 15 | Ukraine | 2 | 7 | 26 | 21.2% | 6.0% | 38 | 0 | 0 | 0 | ... | 0 | 4 | 4 | 13 | 76.5% | 48 | 31 | 4 | 5 | 0 |
16 rows × 32 columns
#Step 14. Present only the Shooting Accuracy from England, Italy and Russia
euro12.loc[[3,7,12] , ['Team','Shooting Accuracy']]
| Team | Shooting Accuracy | |
|---|---|---|
| 3 | England | 50.0% |
| 7 | Italy | 43.0% |
| 12 | Russia | 22.5% |
Housing
Step 1. Import the necessary libraries
Step 2. Create 3 different Series, each of length 100, as follows:
• The first a random number from 1 to 4
• The second a random number from 1 to 3
• The third a random number from 10,000 to 30,000
Step 3. Create a DataFrame by joinning the Series by column
Step 4. Change the name of the columns to bedrs, bathrs, price_sqr_meter
Step 5. Create a one column DataFrame with the values of the 3 Series and assign it to 'bigcolumn'
Step 6. Ops it seems it is going only until index 99. Is it true?
Step 7. Reindex the DataFrame so it goes from 0 to 299
# Step 1 Import the necessary libraries
import pandas as pd
import numpy as np
import random
# Step 2. Create 3 differents Series, each of length 100, as follows:\n",
#The first a random number from 1 to 4
s1 = pd.Series(np.random.randint(1, high=5, size=100, dtype='l'))
s1
0 1
1 2
2 2
3 1
4 4
..
95 1
96 1
97 4
98 3
99 4
Length: 100, dtype: int32
#The second a random number from 1 to 3
s2 = pd.Series(np.random.randint(1, high=4, size=100, dtype='l'))
s2
0 3
1 3
2 3
3 2
4 1
..
95 2
96 1
97 3
98 2
99 2
Length: 100, dtype: int32
#The third a random number from 10,000 to 30,000
s3 = pd.Series(np.random.randint(10000, high=30001, size=100, dtype='l'))
s3
0 15665
1 10223
2 12305
3 16135
4 28406
...
95 13590
96 19569
97 10837
98 22781
99 12075
Length: 100, dtype: int32
# Step 3 Create a DataFrame by joinning the Series by column
housings = pd.concat([s1, s2, s3], axis=1)
housings
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 1 | 3 | 15665 |
| 1 | 2 | 3 | 10223 |
| 2 | 2 | 3 | 12305 |
| 3 | 1 | 2 | 16135 |
| 4 | 4 | 1 | 28406 |
| ... | ... | ... | ... |
| 95 | 1 | 2 | 13590 |
| 96 | 1 | 1 | 19569 |
| 97 | 4 | 3 | 10837 |
| 98 | 3 | 2 | 22781 |
| 99 | 4 | 2 | 12075 |
100 rows × 3 columns
# Step 4 Change the name of the columns to bedrs, bathrs, price_sqr_meter
housings.rename(columns = {0: 'bedrs', 1: 'bathrs', 2: 'price_sqr_meter'}, inplace=True)
housings
| bedrs | bathrs | price_sqr_meter | |
|---|---|---|---|
| 0 | 1 | 3 | 15665 |
| 1 | 2 | 3 | 10223 |
| 2 | 2 | 3 | 12305 |
| 3 | 1 | 2 | 16135 |
| 4 | 4 | 1 | 28406 |
| ... | ... | ... | ... |
| 95 | 1 | 2 | 13590 |
| 96 | 1 | 1 | 19569 |
| 97 | 4 | 3 | 10837 |
| 98 | 3 | 2 | 22781 |
| 99 | 4 | 2 | 12075 |
100 rows × 3 columns
# Step 5 Create a one column DataFrame with the values of the 3 Series and assign it to 'bigcolumn'
# join concat the values
bigcolumn = pd.DataFrame(s1.append(s2.append(s3)))
bigcolumn
| 0 | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 2 |
| 3 | 1 |
| 4 | 4 |
| ... | ... |
| 95 | 13590 |
| 96 | 19569 |
| 97 | 10837 |
| 98 | 22781 |
| 99 | 12075 |
300 rows × 1 columns
# Step 6 Ops it seems it is going only until index 99. Is it true?
bigcolumn.iloc[97:102,]
#YES, It is true
| 0 | |
|---|---|
| 97 | 4 |
| 98 | 3 |
| 99 | 4 |
| 0 | 3 |
| 1 | 3 |
# Step 7 Reindex the DataFrame so it goes from 0 to 299
bigcolumn.reset_index()
| index | 0 | |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 1 | 2 |
| 2 | 2 | 2 |
| 3 | 3 | 1 |
| 4 | 4 | 4 |
| ... | ... | ... |
| 295 | 95 | 13590 |
| 296 | 96 | 19569 |
| 297 | 97 | 10837 |
| 298 | 98 | 22781 |
| 299 | 99 | 12075 |
300 rows × 2 columns
Wind Statistics
The data have been modified to contain some missing values, identified by NaN.
Using pandas should make this exercise easier, in particular for the bonus question.
You should be able to perform all of these operations without using a for loop or other looping construct.
The data in 'wind.data' has the following format:
Yr Mo Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
61 1 1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 61 1 2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 61 1 3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
The first three columns are year, month, and day. The remaining 12 columns are
average windspeeds in knots at 12 locations in Ireland on that day.
Step 1. Import the necessary libraries
Step 2. Import the dataset from this address
Step 3. Assign it to a variable called data and replace the first 3 columns by a proper datetime index.
Step 4. Year 2061? Do we really have data from this year? Create a function to fix it and apply it.
Step 5. Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns].
Step 6. Compute how many values are missing for each location over the entire record.They should be ignored in all calculations below.
Step 7. Compute how many non-missing values there are in total.
Step 8. Calculate the mean windspeeds of the windspeeds over all the locations and all the times.
A single number for the entire dataset.
Step 9. Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
A different set of numbers for each location.
Step 10. Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
A different set of numbers for each day.
Step 11. Find the average windspeed in January for each location. Treat January 1961 and January 1962 both as January.
Step 12. Downsample the record to a yearly frequency for each location.
Step 13. Downsample the record to a monthly frequency for each location.
Step 14. Downsample the record to a weekly frequency for each location.
Step 15. Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks.
# Step 1 Import the necessary libraries
import numpy as np
import pandas as pd
import datetime
# Step 2 Import the dataset from this address
url="https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data"
# Step 3 Assign it to a variable called data and replace the first 3 columns by a proper datetime index
data=pd.read_csv(url,sep="\s+",parse_dates=[[0,1,2]])
data
| Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2061-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
| 1 | 2061-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
| 2 | 2061-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
| 3 | 2061-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
| 4 | 2061-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6569 | 1978-12-27 | 17.58 | 16.96 | 17.62 | 8.08 | 13.21 | 11.67 | 14.46 | 15.59 | 14.04 | 14.00 | 17.21 | 40.08 |
| 6570 | 1978-12-28 | 13.21 | 5.46 | 13.46 | 5.00 | 8.12 | 9.42 | 14.33 | 16.25 | 15.25 | 18.05 | 21.79 | 41.46 |
| 6571 | 1978-12-29 | 14.00 | 10.29 | 14.42 | 8.71 | 9.71 | 10.54 | 19.17 | 12.46 | 14.50 | 16.42 | 18.88 | 29.58 |
| 6572 | 1978-12-30 | 18.50 | 14.04 | 21.29 | 9.13 | 12.75 | 9.71 | 18.08 | 12.87 | 12.46 | 12.12 | 14.67 | 28.79 |
| 6573 | 1978-12-31 | 20.33 | 17.41 | 27.29 | 9.59 | 12.08 | 10.13 | 19.25 | 11.63 | 11.58 | 11.38 | 12.08 | 22.08 |
6574 rows × 13 columns
# Step 4 Year 2061? Do we really have data from this year? Create a function to fix it and apply it
def fix_century(x):
year=x.year-100 if x.year > 1979 else x.year
return datetime.date(year,x.month,x.day)
data['Yr_Mo_Dy']=data['Yr_Mo_Dy'].apply(fix_century)
data.head()
| Yr_Mo_Dy | RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
| 1 | 1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
| 2 | 1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
| 3 | 1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
| 4 | 1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
# Step 5 Set the right dates as the index. Pay attention at the data type, it should be datetime64[ns]
data['Yr_Mo_Dy']=pd.to_datetime(data['Yr_Mo_Dy'])
data=data.set_index('Yr_Mo_Dy')
data.head()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||
| 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 |
| 1961-01-02 | 14.71 | NaN | 10.83 | 6.50 | 12.62 | 7.67 | 11.50 | 10.04 | 9.79 | 9.67 | 17.54 | 13.83 |
| 1961-01-03 | 18.50 | 16.88 | 12.33 | 10.13 | 11.17 | 6.17 | 11.25 | NaN | 8.50 | 7.67 | 12.75 | 12.71 |
| 1961-01-04 | 10.58 | 6.63 | 11.75 | 4.58 | 4.54 | 2.88 | 8.63 | 1.79 | 5.83 | 5.88 | 5.46 | 10.88 |
| 1961-01-05 | 13.33 | 13.25 | 11.42 | 6.17 | 10.71 | 8.21 | 11.92 | 6.54 | 10.92 | 10.34 | 12.92 | 11.83 |
# Step 6 Compute how many values are missing for each location over the entire record
#They should be ignored in all calculations below.
data.isnull().sum()
RPT 6 VAL 3 ROS 2 KIL 5 SHA 2 BIR 0 DUB 3 CLA 2 MUL 3 CLO 1 BEL 0 MAL 4 dtype: int64
# Step 7 Compute how many non-missing values there are in total
data.notnull().sum()
RPT 6568 VAL 6571 ROS 6572 KIL 6569 SHA 6572 BIR 6574 DUB 6571 CLA 6572 MUL 6571 CLO 6573 BEL 6574 MAL 6570 dtype: int64
# Step 8 Calculate the mean windspeeds of the windspeeds over all the locations and all the times
data.mean().mean()
10.227982360836924
# Step 9 Create a DataFrame called loc_stats and calculate the min, max and mean windspeeds and standard deviations of the windspeeds at each location over all the days
loc_stats = pd.DataFrame({
'min': data.min(),
'max': data.max(),
'mean': data.mean(),
'std': data.std()
})
loc_stats
| min | max | mean | std | |
|---|---|---|---|---|
| RPT | 0.67 | 35.80 | 12.362987 | 5.618413 |
| VAL | 0.21 | 33.37 | 10.644314 | 5.267356 |
| ROS | 1.50 | 33.84 | 11.660526 | 5.008450 |
| KIL | 0.00 | 28.46 | 6.306468 | 3.605811 |
| SHA | 0.13 | 37.54 | 10.455834 | 4.936125 |
| BIR | 0.00 | 26.16 | 7.092254 | 3.968683 |
| DUB | 0.00 | 30.37 | 9.797343 | 4.977555 |
| CLA | 0.00 | 31.08 | 8.495053 | 4.499449 |
| MUL | 0.00 | 25.88 | 8.493590 | 4.166872 |
| CLO | 0.04 | 28.21 | 8.707332 | 4.503954 |
| BEL | 0.13 | 42.38 | 13.121007 | 5.835037 |
| MAL | 0.67 | 42.54 | 15.599079 | 6.699794 |
data.describe()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6568.000000 | 6571.000000 | 6572.000000 | 6569.000000 | 6572.000000 | 6574.000000 | 6571.000000 | 6572.000000 | 6571.000000 | 6573.000000 | 6574.000000 | 6570.000000 |
| mean | 12.362987 | 10.644314 | 11.660526 | 6.306468 | 10.455834 | 7.092254 | 9.797343 | 8.495053 | 8.493590 | 8.707332 | 13.121007 | 15.599079 |
| std | 5.618413 | 5.267356 | 5.008450 | 3.605811 | 4.936125 | 3.968683 | 4.977555 | 4.499449 | 4.166872 | 4.503954 | 5.835037 | 6.699794 |
| min | 0.670000 | 0.210000 | 1.500000 | 0.000000 | 0.130000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.040000 | 0.130000 | 0.670000 |
| 25% | 8.120000 | 6.670000 | 8.000000 | 3.580000 | 6.750000 | 4.000000 | 6.000000 | 5.090000 | 5.370000 | 5.330000 | 8.710000 | 10.710000 |
| 50% | 11.710000 | 10.170000 | 10.920000 | 5.750000 | 9.960000 | 6.830000 | 9.210000 | 8.080000 | 8.170000 | 8.290000 | 12.500000 | 15.000000 |
| 75% | 15.920000 | 14.040000 | 14.670000 | 8.420000 | 13.540000 | 9.670000 | 12.960000 | 11.420000 | 11.190000 | 11.630000 | 16.880000 | 19.830000 |
| max | 35.800000 | 33.370000 | 33.840000 | 28.460000 | 37.540000 | 26.160000 | 30.370000 | 31.080000 | 25.880000 | 28.210000 | 42.380000 | 42.540000 |
# Step 10 Create a DataFrame called day_stats and calculate the min, max and mean windspeed and standard deviations of the windspeeds across all the locations at each day.
day_stats = pd.DataFrame({
'min': data.min(axis=1),
'max': data.max(axis=1),
'mean': data.mean(axis=1),
'std': data.std(axis=1)
})
day_stats
| min | max | mean | std | |
|---|---|---|---|---|
| Yr_Mo_Dy | ||||
| 1961-01-01 | 9.29 | 18.50 | 13.018182 | 2.808875 |
| 1961-01-02 | 6.50 | 17.54 | 11.336364 | 3.188994 |
| 1961-01-03 | 6.17 | 18.50 | 11.641818 | 3.681912 |
| 1961-01-04 | 1.79 | 11.75 | 6.619167 | 3.198126 |
| 1961-01-05 | 6.17 | 13.33 | 10.630000 | 2.445356 |
| ... | ... | ... | ... | ... |
| 1978-12-27 | 8.08 | 40.08 | 16.708333 | 7.868076 |
| 1978-12-28 | 5.00 | 41.46 | 15.150000 | 9.687857 |
| 1978-12-29 | 8.71 | 29.58 | 14.890000 | 5.756836 |
| 1978-12-30 | 9.13 | 28.79 | 15.367500 | 5.540437 |
| 1978-12-31 | 9.59 | 27.29 | 15.402500 | 5.702483 |
6574 rows × 4 columns
#Step 11 Find the average windspeed in January for each location. Treat January 1961 and January 1962 both as January
data['date']=data.index
data['month']=data['date'].apply(lambda date:date.month)
data['year']=data['date'].apply(lambda date:date.year)
data['day']=data['date'].apply(lambda date:date.day)
jan_winds=data.query('month==1')
jan_winds.loc[:,'RPT':'MAL'].mean()
RPT 14.847325 VAL 12.914560 ROS 13.299624 KIL 7.199498 SHA 11.667734 BIR 8.054839 DUB 11.819355 CLA 9.512047 MUL 9.543208 CLO 10.053566 BEL 14.550520 MAL 18.028763 dtype: float64
#Step 12 Downsample the record to a yearly frequency for each location.
data.query('month == 1 and day == 1')
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | date | month | year | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||||||
| 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 | 1961-01-01 | 1 | 1961 | 1 |
| 1962-01-01 | 9.29 | 3.42 | 11.54 | 3.50 | 2.21 | 1.96 | 10.41 | 2.79 | 3.54 | 5.17 | 4.38 | 7.92 | 1962-01-01 | 1 | 1962 | 1 |
| 1963-01-01 | 15.59 | 13.62 | 19.79 | 8.38 | 12.25 | 10.00 | 23.45 | 15.71 | 13.59 | 14.37 | 17.58 | 34.13 | 1963-01-01 | 1 | 1963 | 1 |
| 1964-01-01 | 25.80 | 22.13 | 18.21 | 13.25 | 21.29 | 14.79 | 14.12 | 19.58 | 13.25 | 16.75 | 28.96 | 21.00 | 1964-01-01 | 1 | 1964 | 1 |
| 1965-01-01 | 9.54 | 11.92 | 9.00 | 4.38 | 6.08 | 5.21 | 10.25 | 6.08 | 5.71 | 8.63 | 12.04 | 17.41 | 1965-01-01 | 1 | 1965 | 1 |
| 1966-01-01 | 22.04 | 21.50 | 17.08 | 12.75 | 22.17 | 15.59 | 21.79 | 18.12 | 16.66 | 17.83 | 28.33 | 23.79 | 1966-01-01 | 1 | 1966 | 1 |
| 1967-01-01 | 6.46 | 4.46 | 6.50 | 3.21 | 6.67 | 3.79 | 11.38 | 3.83 | 7.71 | 9.08 | 10.67 | 20.91 | 1967-01-01 | 1 | 1967 | 1 |
| 1968-01-01 | 30.04 | 17.88 | 16.25 | 16.25 | 21.79 | 12.54 | 18.16 | 16.62 | 18.75 | 17.62 | 22.25 | 27.29 | 1968-01-01 | 1 | 1968 | 1 |
| 1969-01-01 | 6.13 | 1.63 | 5.41 | 1.08 | 2.54 | 1.00 | 8.50 | 2.42 | 4.58 | 6.34 | 9.17 | 16.71 | 1969-01-01 | 1 | 1969 | 1 |
| 1970-01-01 | 9.59 | 2.96 | 11.79 | 3.42 | 6.13 | 4.08 | 9.00 | 4.46 | 7.29 | 3.50 | 7.33 | 13.00 | 1970-01-01 | 1 | 1970 | 1 |
| 1971-01-01 | 3.71 | 0.79 | 4.71 | 0.17 | 1.42 | 1.04 | 4.63 | 0.75 | 1.54 | 1.08 | 4.21 | 9.54 | 1971-01-01 | 1 | 1971 | 1 |
| 1972-01-01 | 9.29 | 3.63 | 14.54 | 4.25 | 6.75 | 4.42 | 13.00 | 5.33 | 10.04 | 8.54 | 8.71 | 19.17 | 1972-01-01 | 1 | 1972 | 1 |
| 1973-01-01 | 16.50 | 15.92 | 14.62 | 7.41 | 8.29 | 11.21 | 13.54 | 7.79 | 10.46 | 10.79 | 13.37 | 9.71 | 1973-01-01 | 1 | 1973 | 1 |
| 1974-01-01 | 23.21 | 16.54 | 16.08 | 9.75 | 15.83 | 11.46 | 9.54 | 13.54 | 13.83 | 16.66 | 17.21 | 25.29 | 1974-01-01 | 1 | 1974 | 1 |
| 1975-01-01 | 14.04 | 13.54 | 11.29 | 5.46 | 12.58 | 5.58 | 8.12 | 8.96 | 9.29 | 5.17 | 7.71 | 11.63 | 1975-01-01 | 1 | 1975 | 1 |
| 1976-01-01 | 18.34 | 17.67 | 14.83 | 8.00 | 16.62 | 10.13 | 13.17 | 9.04 | 13.13 | 5.75 | 11.38 | 14.96 | 1976-01-01 | 1 | 1976 | 1 |
| 1977-01-01 | 20.04 | 11.92 | 20.25 | 9.13 | 9.29 | 8.04 | 10.75 | 5.88 | 9.00 | 9.00 | 14.88 | 25.70 | 1977-01-01 | 1 | 1977 | 1 |
| 1978-01-01 | 8.33 | 7.12 | 7.71 | 3.54 | 8.50 | 7.50 | 14.71 | 10.00 | 11.83 | 10.00 | 15.09 | 20.46 | 1978-01-01 | 1 | 1978 | 1 |
# Step 13 Downsample the record to a monthly frequency for each location
data.query('day == 1')
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | date | month | year | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||||||
| 1961-01-01 | 15.04 | 14.96 | 13.17 | 9.29 | NaN | 9.87 | 13.67 | 10.25 | 10.83 | 12.58 | 18.50 | 15.04 | 1961-01-01 | 1 | 1961 | 1 |
| 1961-02-01 | 14.25 | 15.12 | 9.04 | 5.88 | 12.08 | 7.17 | 10.17 | 3.63 | 6.50 | 5.50 | 9.17 | 8.00 | 1961-02-01 | 2 | 1961 | 1 |
| 1961-03-01 | 12.67 | 13.13 | 11.79 | 6.42 | 9.79 | 8.54 | 10.25 | 13.29 | NaN | 12.21 | 20.62 | NaN | 1961-03-01 | 3 | 1961 | 1 |
| 1961-04-01 | 8.38 | 6.34 | 8.33 | 6.75 | 9.33 | 9.54 | 11.67 | 8.21 | 11.21 | 6.46 | 11.96 | 7.17 | 1961-04-01 | 4 | 1961 | 1 |
| 1961-05-01 | 15.87 | 13.88 | 15.37 | 9.79 | 13.46 | 10.17 | 9.96 | 14.04 | 9.75 | 9.92 | 18.63 | 11.12 | 1961-05-01 | 5 | 1961 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1978-08-01 | 19.33 | 15.09 | 20.17 | 8.83 | 12.62 | 10.41 | 9.33 | 12.33 | 9.50 | 9.92 | 15.75 | 18.00 | 1978-08-01 | 8 | 1978 | 1 |
| 1978-09-01 | 8.42 | 6.13 | 9.87 | 5.25 | 3.21 | 5.71 | 7.25 | 3.50 | 7.33 | 6.50 | 7.62 | 15.96 | 1978-09-01 | 9 | 1978 | 1 |
| 1978-10-01 | 9.50 | 6.83 | 10.50 | 3.88 | 6.13 | 4.58 | 4.21 | 6.50 | 6.38 | 6.54 | 10.63 | 14.09 | 1978-10-01 | 10 | 1978 | 1 |
| 1978-11-01 | 13.59 | 16.75 | 11.25 | 7.08 | 11.04 | 8.33 | 8.17 | 11.29 | 10.75 | 11.25 | 23.13 | 25.00 | 1978-11-01 | 11 | 1978 | 1 |
| 1978-12-01 | 21.29 | 16.29 | 24.04 | 12.79 | 18.21 | 19.29 | 21.54 | 17.21 | 16.71 | 17.83 | 17.75 | 25.70 | 1978-12-01 | 12 | 1978 | 1 |
216 rows × 16 columns
# Step 15 Calculate the min, max and mean windspeeds and standard deviations of the windspeeds across all locations for each week (assume that the first week starts on January 2 1961) for the first 52 weeks
df_1961 = data[data.index < pd.to_datetime('1962-01-01')]
df_1961.resample('W').mean()
df_1961.resample('W').min()
df_1961.resample('W').max()
df_1961.resample('W').std()
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | month | year | day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | |||||||||||||||
| 1961-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1961-01-08 | 2.631321 | 3.949525 | 1.604761 | 1.810743 | 3.251660 | 2.059546 | 1.872222 | 3.098404 | 1.722255 | 1.704941 | 4.349139 | 1.773062 | 0.000000 | 0.0 | 2.160247 |
| 1961-01-15 | 3.555392 | 3.148945 | 5.034959 | 3.549559 | 3.471726 | 3.251039 | 4.709309 | 3.936894 | 3.500975 | 4.084293 | 5.552215 | 4.692355 | 0.000000 | 0.0 | 2.160247 |
| 1961-01-22 | 5.337402 | 3.837785 | 5.086229 | 6.245541 | 3.612875 | 3.453432 | 5.166300 | 3.164990 | 4.169112 | 4.783952 | 3.626584 | 4.237239 | 0.000000 | 0.0 | 2.160247 |
| 1961-01-29 | 4.619061 | 5.170224 | 4.665843 | 4.301325 | 4.858116 | 3.749415 | 4.508449 | 4.436222 | 4.902057 | 3.713368 | 5.210726 | 3.874721 | 0.000000 | 0.0 | 2.160247 |
| 1961-02-05 | 5.251408 | 5.187395 | 3.975166 | 2.709106 | 2.334619 | 2.397066 | 2.423454 | 4.081158 | 2.802490 | 2.839501 | 4.210858 | 4.336104 | 0.487950 | 0.0 | 13.483676 |
| 1961-02-12 | 3.587677 | 3.608373 | 3.290303 | 2.262056 | 5.571108 | 3.048976 | 2.974059 | 3.022753 | 2.914760 | 1.746749 | 4.063753 | 1.828705 | 0.000000 | 0.0 | 2.160247 |
| 1961-02-19 | 5.064609 | 3.575012 | 4.196621 | 4.311569 | 2.321716 | 3.024078 | 4.958631 | 2.283444 | 2.560591 | 2.531361 | 5.910938 | 4.685377 | 0.000000 | 0.0 | 2.160247 |
| 1961-02-26 | 7.020716 | 5.147348 | 5.578470 | 4.482075 | 6.480712 | 5.029874 | 6.037916 | 4.869668 | 4.705163 | 4.920064 | 5.091162 | 6.182283 | 0.000000 | 0.0 | 2.160247 |
| 1961-03-05 | 0.997721 | 2.851955 | 1.796871 | 1.652572 | 2.957129 | 2.022247 | 3.338177 | 2.877395 | 2.610124 | 1.593685 | 4.332331 | 3.021387 | 0.487950 | 0.0 | 12.027746 |
| 1961-03-12 | 3.732263 | 3.230167 | 3.592909 | 2.609928 | 3.110857 | 3.440419 | 5.537269 | 3.760557 | 3.657690 | 3.655113 | 4.358759 | 5.769890 | 0.000000 | 0.0 | 2.160247 |
| 1961-03-19 | 3.860036 | 2.352867 | 2.939244 | 2.416746 | 2.654289 | 2.045448 | 4.046593 | 3.373421 | 3.301880 | 3.099472 | 3.779727 | 4.331958 | 0.000000 | 0.0 | 2.160247 |
| 1961-03-26 | 3.613298 | 3.657265 | 4.041121 | 2.410127 | 2.927188 | 2.285729 | 2.892270 | 2.789682 | 2.469432 | 2.538224 | 4.318069 | 3.701846 | 0.000000 | 0.0 | 2.160247 |
| 1961-04-02 | 5.046922 | 4.687315 | 3.678996 | 3.625488 | 5.657679 | 4.054639 | 4.047068 | 4.689504 | 4.025317 | 3.191115 | 4.179854 | 3.924555 | 0.487950 | 0.0 | 13.483676 |
| 1961-04-09 | 4.604392 | 2.845399 | 2.902991 | 2.854052 | 2.743895 | 2.525046 | 2.109769 | 1.996609 | 1.935058 | 2.336182 | 3.147781 | 2.598271 | 0.000000 | 0.0 | 2.160247 |
| 1961-04-16 | 3.937727 | 2.607118 | 4.585752 | 2.664416 | 1.941836 | 2.388107 | 1.970667 | 2.379446 | 2.123279 | 2.161137 | 3.641464 | 2.747842 | 0.000000 | 0.0 | 2.160247 |
| 1961-04-23 | 5.676655 | 4.631736 | 5.456290 | 3.835403 | 4.122437 | 3.835998 | 3.254663 | 3.935731 | 3.656682 | 3.347972 | 4.735096 | 5.908542 | 0.000000 | 0.0 | 2.160247 |
| 1961-04-30 | 4.349662 | 2.871425 | 3.732776 | 1.897616 | 3.259907 | 2.528436 | 3.565777 | 3.426255 | 2.718926 | 2.840568 | 2.948237 | 5.108365 | 0.000000 | 0.0 | 2.160247 |
| 1961-05-07 | 5.025507 | 3.750835 | 4.301778 | 3.141495 | 5.781916 | 3.324038 | 4.419008 | 4.461950 | 3.558856 | 3.620819 | 8.003490 | 7.728504 | 0.000000 | 0.0 | 2.160247 |
| 1961-05-14 | 3.371022 | 3.782947 | 3.002225 | 3.509695 | 4.813159 | 4.660173 | 5.744586 | 4.690361 | 5.126826 | 5.460237 | 3.968272 | 7.858246 | 0.000000 | 0.0 | 2.160247 |
| 1961-05-21 | 3.631730 | 2.468906 | 3.767505 | 0.668791 | 2.541271 | 1.377071 | 1.650089 | 2.221636 | 1.613885 | 2.216889 | 1.975853 | 3.310819 | 0.000000 | 0.0 | 2.160247 |
| 1961-05-28 | 2.739433 | 3.378537 | 6.355768 | 1.378803 | 2.945134 | 1.873381 | 2.350104 | 3.207147 | 1.990891 | 2.575661 | 3.024524 | 3.811818 | 0.000000 | 0.0 | 2.160247 |
| 1961-06-04 | 3.099701 | 1.868125 | 3.474607 | 2.323826 | 2.305695 | 1.332009 | 2.300028 | 0.915184 | 1.954609 | 2.096989 | 2.611139 | 2.593586 | 0.534522 | 0.0 | 14.738999 |
| 1961-06-11 | 2.248597 | 1.524836 | 1.887475 | 1.603775 | 2.238228 | 2.204087 | 2.514689 | 2.857527 | 1.695090 | 2.158323 | 3.993062 | 4.925055 | 0.000000 | 0.0 | 2.160247 |
| 1961-06-18 | 3.009482 | 3.509444 | 3.429057 | 2.130352 | 4.175947 | 3.292713 | 3.417989 | 4.858023 | 2.900464 | 3.792400 | 6.477887 | 6.242673 | 0.000000 | 0.0 | 2.160247 |
| 1961-06-25 | 1.982035 | 2.212460 | 1.916454 | 1.332234 | 1.422401 | 1.448318 | 2.455385 | 2.109358 | 2.042257 | 2.286218 | 2.498386 | 3.063011 | 0.000000 | 0.0 | 2.160247 |
| 1961-07-02 | 2.557856 | 2.902411 | 1.161629 | 1.911018 | 2.172125 | 1.626363 | 1.927673 | 3.065467 | 1.745125 | 1.564144 | 6.303747 | 3.652313 | 0.487950 | 0.0 | 12.998168 |
| 1961-07-09 | 3.664855 | 2.686658 | 2.995919 | 2.898890 | 3.661383 | 2.459639 | 3.493425 | 3.160424 | 3.453210 | 3.657179 | 4.537988 | 3.665705 | 0.000000 | 0.0 | 2.160247 |
| 1961-07-16 | 5.168710 | 3.849630 | 2.166206 | 2.665625 | 4.596567 | 3.164715 | 4.489523 | 4.083629 | 3.978635 | 3.271899 | 4.971060 | 4.974273 | 0.000000 | 0.0 | 2.160247 |
| 1961-07-23 | 1.047978 | 1.400010 | 2.783208 | 1.158908 | 0.764956 | 0.617815 | 1.757905 | 1.479152 | 0.873207 | 1.439785 | 2.050218 | 2.133994 | 0.000000 | 0.0 | 2.160247 |
| 1961-07-30 | 4.157641 | 3.203206 | 3.276465 | 2.843207 | 4.514505 | 3.786633 | 5.110195 | 4.060478 | 4.504844 | 4.350268 | 5.580903 | 6.664574 | 0.000000 | 0.0 | 2.160247 |
| 1961-08-06 | 2.950887 | 3.985226 | 3.483894 | 1.647275 | 4.189524 | 2.642246 | 3.709319 | 3.921508 | 3.217501 | 2.903018 | 4.901377 | 4.448251 | 0.377964 | 0.0 | 10.533394 |
| 1961-08-13 | 4.422268 | 2.053326 | 2.174407 | 1.921080 | 3.071872 | 2.045195 | 3.073187 | 2.088368 | 2.718228 | 2.073777 | 2.931302 | 3.356585 | 0.000000 | 0.0 | 2.160247 |
| 1961-08-20 | 2.283635 | 2.523416 | 1.766039 | 1.938109 | 2.568539 | 2.227127 | 3.197566 | 2.972809 | 2.339544 | 2.730237 | 4.086725 | 3.934238 | 0.000000 | 0.0 | 2.160247 |
| 1961-08-27 | 3.395857 | 3.174702 | 3.071403 | 2.455711 | 4.380624 | 3.416186 | 4.170083 | 4.900671 | 4.120185 | 3.855302 | 6.711322 | 4.947608 | 0.000000 | 0.0 | 2.160247 |
| 1961-09-03 | 4.398615 | 7.474025 | 4.395383 | 3.790030 | 6.128522 | 4.455552 | 3.920110 | 5.012759 | 4.155931 | 3.993736 | 7.678051 | 6.308087 | 0.534522 | 0.0 | 14.738999 |
| 1961-09-10 | 5.207278 | 4.003996 | 3.922643 | 3.151502 | 3.178791 | 2.965546 | 4.258696 | 3.074692 | 3.767708 | 3.649278 | 4.220584 | 6.049619 | 0.000000 | 0.0 | 2.160247 |
| 1961-09-17 | 7.679190 | 5.360585 | 6.638947 | 5.560509 | 6.279010 | 4.706530 | 4.235755 | 4.565163 | 4.847073 | 5.128338 | 4.464252 | 6.332885 | 0.000000 | 0.0 | 2.160247 |
| 1961-09-24 | 1.267399 | 3.445262 | 2.322229 | 1.609614 | 3.125219 | 2.293031 | 1.444916 | 3.342247 | 1.523142 | 2.354092 | 5.235868 | 3.113507 | 0.000000 | 0.0 | 2.160247 |
| 1961-10-01 | 4.559572 | 2.812482 | 4.201062 | 2.050553 | 3.353967 | 1.901786 | 2.098203 | 3.425963 | 1.790306 | 3.908397 | 5.091268 | 4.696504 | 0.377964 | 0.0 | 10.160615 |
| 1961-10-08 | 5.596710 | 5.060803 | 4.394234 | 3.249050 | 3.830935 | 3.686641 | 3.420894 | 3.313201 | 3.717239 | 4.296870 | 4.800403 | 5.462002 | 0.000000 | 0.0 | 2.160247 |
| 1961-10-15 | 4.780675 | 2.707483 | 5.979099 | 2.367850 | 2.246657 | 1.969085 | 3.409743 | 2.506401 | 2.947025 | 3.569308 | 4.113200 | 4.098130 | 0.000000 | 0.0 | 2.160247 |
| 1961-10-22 | 7.888314 | 5.998199 | 5.463782 | 4.989763 | 6.095112 | 3.801637 | 5.334910 | 5.199929 | 3.785875 | 5.890511 | 5.645871 | 7.468377 | 0.000000 | 0.0 | 2.160247 |
| 1961-10-29 | 7.957637 | 6.879973 | 7.428776 | 5.576503 | 7.344744 | 5.101188 | 4.542684 | 5.506544 | 4.448500 | 6.277629 | 7.056150 | 8.340881 | 0.000000 | 0.0 | 2.160247 |
| 1961-11-05 | 3.369201 | 3.900278 | 2.461109 | 2.448363 | 3.664806 | 2.513051 | 2.727816 | 3.599111 | 3.003274 | 2.784450 | 4.038493 | 3.870800 | 0.487950 | 0.0 | 13.483676 |
| 1961-11-12 | 3.939811 | 2.141191 | 6.779554 | 2.930324 | 2.090641 | 1.414308 | 2.388943 | 2.181840 | 2.153102 | 2.843518 | 2.532196 | 3.690752 | 0.000000 | 0.0 | 2.160247 |
| 1961-11-19 | 2.784358 | 3.208548 | 9.415716 | 4.176374 | 2.092809 | 3.440953 | 7.027159 | 3.452202 | 3.488521 | 4.402588 | 3.643285 | 3.787654 | 0.000000 | 0.0 | 2.160247 |
| 1961-11-26 | 3.214368 | 3.277904 | 3.622254 | 1.824938 | 2.132751 | 1.822767 | 3.551685 | 2.276663 | 2.392934 | 2.747452 | 5.407223 | 6.475867 | 0.000000 | 0.0 | 2.160247 |
| 1961-12-03 | 5.704669 | 5.107089 | 4.163650 | 4.097123 | 5.536334 | 3.534980 | 4.186505 | 3.744471 | 3.658796 | 3.680477 | 5.552648 | 5.233192 | 0.534522 | 0.0 | 14.205968 |
| 1961-12-10 | 4.890152 | 4.115506 | 4.682044 | 3.782631 | 3.345941 | 3.481252 | 4.357862 | 4.503161 | 3.452864 | 4.156207 | 4.667933 | 7.345893 | 0.000000 | 0.0 | 2.160247 |
| 1961-12-17 | 4.095106 | 3.587886 | 3.843380 | 3.501489 | 3.943048 | 3.312448 | 6.474478 | 3.482027 | 4.333020 | 4.633398 | 6.531043 | 5.665006 | 0.000000 | 0.0 | 2.160247 |
| 1961-12-24 | 4.959717 | 2.220866 | 5.653229 | 5.341288 | 4.898600 | 4.019469 | 5.724775 | 4.134367 | 4.243508 | 4.637096 | 5.065308 | 5.048035 | 0.000000 | 0.0 | 2.160247 |
| 1961-12-31 | 5.787783 | 4.566479 | 9.739918 | 4.167851 | 3.018856 | 2.674370 | 5.142213 | 4.289240 | 3.325214 | 3.526625 | 3.262217 | 3.012729 | 0.000000 | 0.0 | 2.160247 |
Step 1. Import the necessary libraries
Step 2. Import the dataset from this address.
Step 3. Assign it to a variable called chipo.
Step 4. See the first 10 entries
Step 5. What is the number of observations in the dataset?
Step 6. What is the number of columns in the dataset?
Step 7. Print the name of all the columns.
Step 8. How is the dataset indexed?
Step 9. Which was the most-ordered item?
Step 10. For the most-ordered item, how many items were ordered?
Step 11. What was the most ordered item in the choice_description column?
Step 12. How many items were orderd in total?
Step 13.
• Turn the item price into a float
• Check the item price type
• Create a lambda function and change the type of item price
• Check the item price type
Step 14. How much was the revenue for the period in the dataset?
Step 15. How many orders were made in the period?
Step 16. What is the average revenue amount per order?
Step 17. How many different items are sold?
# Step 1 Import the necessary libraries
import pandas as pd
# Step 2 Import the dataset from this address.
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
# Step 3 Assign it to a variable called chipo
chipo = pd.read_csv(url, sep = '\t')
chipo
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
| 1 | 1 | 1 | Izze | [Clementine] | $3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
| ... | ... | ... | ... | ... | ... |
| 4617 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... | $11.75 |
| 4618 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... | $11.75 |
| 4619 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $11.25 |
| 4620 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... | $8.75 |
| 4621 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $8.75 |
4622 rows × 5 columns
#Step 4 See the first 10 entries
chipo.head(10)
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
| 1 | 1 | 1 | Izze | [Clementine] | $3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
| 5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 |
| 6 | 3 | 1 | Side of Chips | NaN | $1.69 |
| 7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 |
| 8 | 4 | 1 | Steak Soft Tacos | [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | $9.25 |
| 9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 |
# Step 5 What is the number of observations in the dataset?
chipo.shape[0]
4622
# Step 6 What is the number of columns in the dataset?
chipo.shape[1]
5
#step 5 & 6 Display number of rows & columns
chipo.shape
(4622, 5)
# Step 7 Print the name of all the columns
chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
# Step 8 How is the dataset indexed?
chipo.index
RangeIndex(start=0, stop=4622, step=1)
# Which was the most-ordered item?
chipo_grp = chipo.groupby('item_name')['quantity']
chip = chipo_grp.sum()
print('The most-ordered item is {} and has been order {} times.'.format(chip.idxmax(), chip.max()))
The most-ordered item is Chicken Bowl and has been order 761 times.
# What was the most ordered item in the choice_description column?
chipo_desc_grp = chipo.groupby('choice_description')['quantity']
chip = chipo_desc_grp.sum()
print('The most ordered item in the choice_description column is {}'.format(chip.idxmax()))
print('And it has been ordered {} times'.format(chip.max()))
The most ordered item in the choice_description column is [Diet Coke] And it has been ordered 159 times
# Step 12 How many items were orderd in total?
chipo.quantity.sum()
4972
# Step 13
# Turn the item price into a float
def num_float(x):
return float(x[1:-1])
chipo.item_price = chipo.item_price.apply(num_float)
chipo
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | 2.39 |
| 1 | 1 | 1 | Izze | [Clementine] | 3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | 3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | 2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | 16.98 |
| ... | ... | ... | ... | ... | ... |
| 4617 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... | 11.75 |
| 4618 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... | 11.75 |
| 4619 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | 11.25 |
| 4620 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... | 8.75 |
| 4621 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | 8.75 |
4622 rows × 5 columns
# Step 14 How much was the revenue for the period in the dataset?
chipo.item_price.sum()
34500.16
# Step 15 How many orders were made in the period?
chipo.order_id.value_counts().count()
1834
# Step 16 What is the average revenue amount per order?
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped['quantity'].mean()
2.711014176663032
# Step 17 What are the three most expensvie items?
chipo['unit_price'] = (chipo['item_price'] / chipo['quantity'])
new = chipo[['item_name', 'choice_description', 'unit_price']]
new.sort_values(by='unit_price', ascending=False)
| item_name | choice_description | unit_price | |
|---|---|---|---|
| 2442 | Barbacoa Salad Bowl | [Tomatillo Green Chili Salsa, [Rice, Fajita Ve... | 11.89 |
| 2957 | Steak Salad Bowl | [Fresh Tomato Salsa, [Black Beans, Cheese, Gua... | 11.89 |
| 613 | Steak Salad Bowl | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | 11.89 |
| 2401 | Steak Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Guaca... | 11.89 |
| 281 | Steak Salad Bowl | [Tomatillo Red Chili Salsa, [Black Beans, Chee... | 11.89 |
| ... | ... | ... | ... |
| 378 | Canned Soda | [Mountain Dew] | 1.09 |
| 1805 | Canned Soda | [Sprite] | 1.09 |
| 2520 | Canned Soda | [Coca Cola] | 1.09 |
| 2981 | Bottled Water | NaN | 1.09 |
| 646 | Canned Soda | [Sprite] | 1.09 |
4622 rows × 3 columns
Create a line plot showing the number of marriages and divorces per capita in the U.S. between 1867 and 2014. Label both lines and show the legend.Don't forget to label your axes!
import pandas as pd
df=pd.read_csv('us-marriages-divorces-1867-2014.csv')
import matplotlib.pyplot as plt
df.plot('Year',y=['Marriages','Divorces'],color=['purple','red'],figsize=(15,10))
plt.xlabel('Year',fontsize=14,color='blue')
plt.show()
Create a vertical bar chart comparing the number of marriages and divorces per capita in the U.S. between 1900, 1950, and 2000. Don't forget to label your axes!
df1 = df[df['Year'] >= 1900 ]
df2 = df1[df['Year'] <=2000 ]
df2
<ipython-input-59-f3c155301355>:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index. df2 = df1[df['Year'] <=2000 ]
| Year | Marriages | Divorces | Population | Marriages_per_1000 | Divorces_per_1000 | |
|---|---|---|---|---|---|---|
| 33 | 1900 | 709000.0 | 56000.0 | 76094000 | 9.3 | 0.7 |
| 34 | 1901 | 742000.0 | 61000.0 | 77584000 | 9.6 | 0.8 |
| 35 | 1902 | 776000.0 | 61000.0 | 79163000 | 9.8 | 0.8 |
| 36 | 1903 | 818000.0 | 65000.0 | 80632000 | 10.1 | 0.8 |
| 37 | 1904 | 815000.0 | 66000.0 | 82166000 | 9.9 | 0.8 |
| ... | ... | ... | ... | ... | ... | ... |
| 129 | 1996 | NaN | NaN | 269714000 | NaN | NaN |
| 130 | 1997 | 2384000.0 | 1163000.0 | 272958000 | 8.7 | 4.3 |
| 131 | 1998 | 2244000.0 | 1135000.0 | 276154000 | 8.1 | 4.1 |
| 132 | 1999 | 2358000.0 | 1145245.0 | 279328000 | 8.4 | 4.1 |
| 133 | 2000 | 2315000.0 | 944000.0 | 282398000 | 8.2 | 3.3 |
101 rows × 6 columns
fig= plt.figure(figsize=(10,12))
plt.bar(df2['Marriages_per_1000'],df2['Divorces_per_1000'], label="marriages and divorces per capita",color='brown')
plt.legend()
# The following commands add labels to our figure.
plt.xlabel('Marriages',fontsize=12,color='blue')
plt.ylabel('Divorces',fontsize=12,color='blue')
plt.title('Marriages and Divorces per capita in the U.S. between 1900, 1950 and 2000',fontsize=12,color='green')
plt.show()
Create a horizontal bar chart that compares the deadliest actors in Hollywood. Sort the actors by their kill count and label each bar with the corresponding actor's name. Don't forget to label your axes!
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("actor_kill_counts.csv")
fig= plt.figure(figsize=(10,8))
plt.barh(df['Actor'], df['Count'],color='rgb')
plt.ylabel("Actor",fontsize=12,color='red')
plt.xlabel("Kill count",fontsize=12,color='red')
plt.title("Actor's kill count",fontsize=15,color='blue')
plt.show()
<ipython-input-61-7c5d8ed786f1>:6: MatplotlibDeprecationWarning: Using a string of single character colors as a color sequence is deprecated since 3.2 and will be removed two minor releases later. Use an explicit list instead. plt.barh(df['Actor'], df['Count'],color='rgb')
Create a pie chart showing the fraction of all Roman Emperors that were assassinated.
Make sure that the pie chart is an even circle, labels the categories, and shows the percentage breakdown of the categories.
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
roman = pd.read_csv('roman-emperor-reigns.csv')
roman.Cause_of_Death.unique()
array(['Possibly assassinated', 'Assassinated', 'Suicide',
'Natural causes', 'Illness', 'Executed', 'Killed in battle',
'Died in captivity'], dtype=object)
assassinated = roman.Cause_of_Death.value_counts()
fig = {
"data": [
{
"values": list(assassinated.values),
"labels": list(assassinated.index),
"domain": {"x": [.4, .4]},
"name": "Cause of death",
"hoverinfo":"label+percent",
"hole": 0,
"type": "pie"
},],
"layout": {
"title":"Cause of death of Roman Emperors",
"annotations": [
{ "font": { "size": 20},
"showarrow": False,
"text": "",
"x": 0.20,
"y": 1
},
]
}
}
iplot(fig)
Create a scatter plot showing the relationship between the total revenue earned by arcades and the number of Computer Science PhDs awarded in the U.S. between 2000 and 2009.
Don't forget to label your axes! Color each dot according to its year.
df1=pd.read_csv(r'arcade-revenue-vs-cs-doctorates.csv')
fig= plt.figure(figsize=(12,9))
plt.scatter(x="Total Arcade Revenue (billions)",y="Computer Science Doctorates Awarded (US)",c=df1.Year,cmap="viridis",data=df1,s=200)
plt.xlabel('Total Arcade Revenue (billions)',fontsize=10,color='blue')
plt.ylabel('Computer Science Doctorates Awarded (US)',fontsize=10,color='blue')
plt.show()